Skip to content

[RFC] Wire contract + mechanical wire-compat CI gate (complement to #1920)#2

Closed
silas-scitix wants to merge 1 commit into
mainfrom
feat/wire-compat-rfc
Closed

[RFC] Wire contract + mechanical wire-compat CI gate (complement to #1920)#2
silas-scitix wants to merge 1 commit into
mainfrom
feat/wire-compat-rfc

Conversation

@silas-scitix

@silas-scitix silas-scitix commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Draft RFC + a working golden type-code CI gate that mechanically detects RPC wire breaks before merge. Positioned as a complement to kvcache-ai#1920 (Rolling Upgrade Support): kvcache-ai#1920 owns the rolling-upgrade policy/features; this RFC owns the contract definition and its enforcement -- the part kvcache-ai#1920 explicitly leaves to "code review".

Why this matters in production (and why I want to land it)

The Mooncake Store master is on the hot path for every prefill: thousands of per-rank clients depend on it for KV-cache metadata. A wire break there is not a localized bug -- it is a fleet-wide outage with a specific, expensive failure mode:

  • No rolling upgrade, no canary. A shifted RPC type-code makes the peer hard-reject the call (errc::invalid_rpc_arguments), so an upgraded client talking to a not-yet-upgraded master (or vice versa) simply fails. "Upgrade 10% and watch" is impossible; master and every client must switch in one atomic step.
  • The upgrade window costs the cache itself. A stop-the-world switch makes the prefix cache unavailable across the cutover, forcing prefill recompute and a TTFT spike when traffic resumes. Blue-green avoids the downtime but needs ~2x resources for the window and starts cold (L2/L3 re-warm). Either way you pay in latency exactly when the cluster is busiest.
  • It can recur silently. RPC identity is computed implicitly by the compiler (function_id = MD5Hash32(name), arg_type_code = MD5Hash32(layout) & 0xFFFFFFFE), derived from source layout, not from any declared version. An additive change can shift the wire while compiling cleanly and passing review -- so "enforce via code review" is not enforcement.

A concrete upcoming case we are watching

This is not hypothetical for us. #2288 ("Propagate tenant identity through object RPCs") appends a bare trailing tenant_id to ~30 WrappedMasterService handlers, which shifts their struct_pack argument type-codes (e.g. PutStart 0xfad0c534 -> 0x22f8edba). As soon as that lands in a release, a v0.3.11 peer and the new master hard-reject each other's affected RPCs -- i.e. it breaks rolling upgrade across that boundary. It is a live case heading for the next version that will hit our production fleet, and exactly the class of change a mechanical gate is meant to surface (intentionally, as a reviewable red check) rather than ship silently.

The value is concrete: a frozen wire contract + a pre-merge gate turns "did we break the wire?" from a production incident into a red CI check, and the runtime version-digest turns silent same-version drift between deployed peers into an observable signal. That is what makes rolling upgrades actually safe to run -- the whole promise of kvcache-ai#1920. I am strongly motivated to get this compatibility design landed in production, not just discussed, and happy to do the work to make it production-ready: maintain the golden contract, wire the gate as a required check, build out the cross-version interop matrix, adapt to whatever shape the maintainers prefer, and help land the forward-compatible serialization pieces of kvcache-ai#1920's Phase 1 on top of this enforcement layer.

What's in this PR (docs + gate tool only, NO production code changed)

  • WIRE-COMPAT-RFC.md -- the proposal: frozen wire contract, pre-merge CI gate, version-gate redesign, evolution rules, interop methodology.
  • gate/ -- a small generator + checker that emits every handler's function_id + arg_type_code (+ wire structs) into a checked-in golden file and fails on any unintended drift. Demonstrated: clean tree = green (59 handlers); inject a bare-trailing-arg of the #2288 shape on PutEnd = red, naming the handler with old->new code (0x61232de0 -> 0x1b7a228a, function-id unchanged); revert = green again. Built against vendored ylt 0.6.0.
  • ci/wire-compat-gate.yml -- GitHub Actions wiring the checker as a required check on PRs touching the RPC surface.

Version-gate redesign (Section 4)

Layers on kvcache-ai#1920's proposed structured ServiceReady: adds a wire-contract digest derived from code (not a hand-edited constant), so same-version wire drift becomes observable instead of silent. The digest is the runtime mirror of the CI gate -- the gate prevents drift pre-merge, the digest surfaces drift between already-deployed peers.

Note

Internal sample PR (fork-only) for review before any upstream proposal.

…vcache-ai#1920)

Adds a draft RFC plus a working golden type-code gate that detects RPC
wire breaks before merge. Complements kvcache-ai#1920 (Rolling Upgrade) by supplying
the enforcement kvcache-ai#1920 leaves to code review: a machine-checkable frozen
wire contract (function-ids + arg-tuple type-codes + wire structs), a
pre-merge CI gate, a version-gate redesign, and a cross-version interop
test methodology. Motivated by the kvcache-ai#2288 silent wire break (a bare
trailing tenant_id shifted ~30 handler arg type-codes; nothing in CI
caught it). Docs + gate tool only, no production code changed.
@github-actions github-actions Bot added documentation Improvements or additions to documentation run-ci labels Jun 23, 2026
@silas-scitix

Copy link
Copy Markdown
Collaborator Author

Superseded by the upstream PR kvcache-ai#2579.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant